Lab Topics

Regression Diagnostics: - Outlier, leverage, and influence - added variable and marginal model plots

Goals:

After this lab you will be able to: - assess and diagnose the extent to which outlying observations are driving your results - assess the impact of a given variablie within a multiple regression model

This lab uses materials by - Angela Dixon - Andrew Bray - Andrew Bray and Mine Cetinkaya-Rundel - Brian Caffo, Jeff Leek, Roger Peng

The linear model:

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Let’s take a look the usage of “rail trails,” which are trail systems that are built on old rail grades; we will explore the relationship between temperature and ridership. Load in the RailTrail data set in the mosaicData package:

head(RailTrail)
##   hightemp lowtemp avgtemp spring summer fall cloudcover precip volume
## 1       83      50    66.5      0      1    0        7.6   0.00    501
## 2       73      49    61.0      0      1    0        6.3   0.29    419
## 3       74      52    63.0      1      0    0        7.5   0.32    397
## 4       95      61    78.0      0      1    0        2.6   0.00    385
## 5       44      52    48.0      1      0    0       10.0   0.14    200
## 6       69      54    61.5      1      0    0        6.6   0.02    375
##   weekday
## 1       1
## 2       1
## 3       1
## 4       0
## 5       1
## 6       1
  1. Using a scatterplot, examine the bivariate relationship between ridership (volume) and high temperature that day (hightemp):

  1. What type of relationship do you observe? Does a linear model at least appear worth trying? What type of relationship do you expect?

I certainly appears that the linear model is worth trying. It looks like there might be a bit of a curvilinear relationship at high temperature levels, as rideship appears to bo up to about 80 degrees and then decrease again, but we’ll try a linear fit and go from there.

  1. Fit a bivariate regression of volume on hightemp
mod.rider <- lm(volume~hightemp,data=RailTrail)
  1. Describe the estimated relationship between volume of ridership and the high temperature hightemp.
summary(mod.rider)
## 
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -254.562  -57.800    8.737   57.352  314.035 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.079     59.395  -0.288    0.774    
## hightemp       5.702      0.848   6.724 1.71e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104.2 on 88 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.3319 
## F-statistic: 45.21 on 1 and 88 DF,  p-value: 1.705e-09

Each 1 degree increase in daily high temperature is predicted to increase ridership volume by 5.7 people, holding all other variables constant (though obviously there aren’t other variables in this model).

Test whether the conditions for regression appear reasonable:

Condition 1: Linearity

Linearity: You already checked if the relationship between ridership and temperature is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. temperature.

plot(rstandard(mod.rider) ~ RailTrail$hightemp)
abline(h = 0, lty = 3)
abline(h= 2,lty=2)
abline(h=-2,lty=2)

  1. Is there any apparent pattern in the plot? What does this indicate about the linearity of the relationship between ridership and temperature?

As we might have guess given the basic scatterplot, ridership appears to be be somewhat linear in high temperature, but it appears that the linear model does not perform well at very high or very low temperatures. Namely, the line basically overpredicts ridership every time at high and low temperatures, when in an ideal-fitting model the residuals would be distributed around zero at all levels of temperature.

Condition 2: Nearly Normal Residuals

Nearly normal residuals: To check this condition, we can look at a histogram

hist(mod.rider$residuals,breaks=10)

or a normal probability plot of the residuals.

qqnorm(mod.rider$residuals)
qqline(mod.rider$residuals)  # adds diagonal line to the normal prob plot

mod.rider$fitted
##        1        2        3        4        5        6        7        8 
## 456.1766 399.1578 404.8597 524.5991 233.8034 376.3503 359.2447 359.2447 
##        9       10       11       12       13       14       15       16 
## 439.0710 433.3691 427.6672 353.5428 216.6977 319.3315 268.0146 290.8221 
##       17       18       19       20       21       22       23       24 
## 536.0029 410.5616 342.1390 222.3996 382.0522 307.9278 387.7541 245.2071 
##       25       26       27       28       29       30       31       32 
## 330.7353 444.7729 347.8409 307.9278 336.4372 456.1766 416.2635 404.8597 
##       33       34       35       36       37       38       39       40 
## 319.3315 421.9653 279.4184 364.9466 273.7165 216.6977 496.0898 382.0522 
##       41       42       43       44       45       46       47       48 
## 444.7729 393.4559 382.0522 376.3503 319.3315 325.0334 393.4559 382.0522 
##       49       50       51       52       53       54       55       56 
## 416.2635 302.2259 359.2447 461.8785 313.6297 302.2259 433.3691 364.9466 
##       57       58       59       60       61       62       63       64 
## 410.5616 353.5428 382.0522 347.8409 273.7165 421.9653 325.0334 404.8597 
##       65       66       67       68       69       70       71       72 
## 461.8785 330.7353 262.3127 387.7541 382.0522 456.1766 496.0898 507.4935 
##       73       74       75       76       77       78       79       80 
## 484.6860 336.4372 473.2822 364.9466 319.3315 490.3879 342.1390 319.3315 
##       81       82       83       84       85       86       87       88 
## 404.8597 439.0710 296.5240 336.4372 404.8597 524.5991 353.5428 296.5240 
##       89       90 
## 490.3879 439.0710
  1. How do these look to you?

The histogram of residuals actually looks pretty good. There is a slight right skew,but overall the distribution of residuals does not evidence a major problem. The q-q plot helps illuminat the problem a little better, since it’s clear that the residuals at either tail do not quite fit the linear model. Again though, recall that q-q plot tails don’t even look perfect for simulated data from a known distribution, so again this is probably not evidence of a major problem. This is a good example of how you need to check for multiple modeling issues, since some diagnostics might look just fine even when others do not.

Condition 3: Constant variability

Constant variability: There are tests for heteroskedasticity, but generally you can use plots as a rough heuristic at least when doing preliminary fitting. Constant variability means that the variability of points around the regression line remains consistant for all values of X. To test this, we again use the plot of residuals ~ independent variable to check this, and also the residuals ~ fitted values plot:

plot(mod.rider$residuals ~ RailTrail$hightemp)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot(mod.rider$residuals ~ mod.rider$fitted.values)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

(they should look just about the same - why?)

  1. Based on the plot, does the constant variability condition appear to be met?

It’s not quite clear, but it doesn’t look like constant variability is met since we observed fairly large variability at medium-range temperatures and lower variability at high and low temps. Let’s run a quick Brusch-Pagan test:

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(mod.rider)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod.rider
## BP = 4.7222, df = 1, p-value = 0.02977

Hmm, it looks like we have a small heteroskedasticity problem, since \(p < 0.05\). Let’s move on for now, but strategies you might use include transforming a variable (e.g., log), weighted regression, and in this case adding another variable might fix the problem (can you guess what?).

One other thing:

par(mfrow=c(2,2))
plot(mod.rider)

is a shortcut to look at model diagnostics very quickly. How do these results compare with what we did above?

Outliers, leverage, influence

Outliers are points that don’t fit the trend in the rest of the data.

High leverage points have the potential to have an unusually large influence on the fitted model.

Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.

Influence measures

Recall that there are numerous ways of assessing the influence that a given observation has on your model…

  • Do ?influence.measures to see the full suite of influence measures in stats. The measures include
  • rstandard - standardized residuals, residuals divided by their standard deviations)
  • rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
  • hatvalues - measures of leverage
  • dffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.
  • dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.
  • cooks.distance - overall change in the coefficients when the \(i^{th}\) point is deleted.
  • resid - returns the ordinary residuals
  • resid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.

Find Outliers

Standardized Residuals

One of the more common tools is standardized residuals. Standardized residuals are essentially “z-score residuals”, so if you are using a \(95%\) confidence level, you can take a quick look to see if any standardized residuals are greater than 2 (or 1.96) or less than -2 (-1.96):

par(mfrow=c(1,1))
plot(rstandard(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2

  1. How do your standardized residuals look? How many points would you expect to be outside of the dashed lines simply due to random chance (hint: what does the \(95%\) interval really mean)?

This looks pretty good. We have 90 obs, so we would expect 4 observations or so to have outlying residuals.

Studentized Residuals

Studentized residuals are just like standardized residuals, but we leave out a given point when computing the sd. That way, the very point you are trying to analzye does not factor into the standardization. In all likelihood, in most cases you won’t notice much difference between the two.

plot(rstudent(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2

  1. Comment on any observed differences between the studentized and standardized residuals.

As you might expect, there is not much difference.

Assess Leverage

We can assess leverage by plotting hatvalues….

hat <- hatvalues(mod.rider)
plot(hat)

Measure influence

dfbetas and Cook’s Distance are both common ways to measure influence:

dfbetas are the difference in \(\beta_k\) when observation \(i\) is left out of the model. We care about how each point influences \(\beta_1\) in this case, not \(\beta_0\), so we’ll only plot the second column of results.

modbetas = dfbetas(mod.rider)
plot(modbetas[,2])

Cook’s Distance is an alternative measure. Cook’s Distance is a summary metric that captures the total change in all model parameters due to a given observation. For this reason, there is no absolute standard of waht is a “large” or “small” Cook’s distance. The most general criteria is that a point is highly influential when \(D_i>1\). However, the number of observations obviously directly influences the influence any one point can have independent of the values of the observation, so \(D_i > 4/n\) is also a common criteria.

modcooks = cooks.distance(mod.rider)
plot(modcooks)
abline(h=4/nrow(RailTrail),col='red')

  1. Comment on the dfbeta and Cook’s Distance results. What do you think might be driving some of these influential points?

It does appear that there are a few influential points that serve to change the coefficient estimate. In particular, there are several observations that pull the coefficient down quite significantly. My hunch is that these are rainy days where nobody wanted to rider their bike.

Combinging Leverage, Outliers, and Influence

Bubble Plot

It’s tought to know how useful a bubble plot is, but it’s fun to make!

plot(hatvalues(mod.rider), rstudent(mod.rider), type='n')
cook <- sqrt(cooks.distance(mod.rider))
points(hatvalues(mod.rider), rstudent(mod.rider), cex=10*cook/max(cook))
abline(h=c(-2,0,2), lty=2)
abline(v=c(2,3) * 3/45, lty=2)

What might we do?

robust regression downweights influential data

library(MASS)
mod.rider.robust = rlm(volume~hightemp,data=RailTrail)
summary(mod.rider.robust)
## 
## Call: rlm(formula = volume ~ hightemp, data = RailTrail)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.802  -55.176    8.995   58.812  314.594 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) -25.3436  56.8986    -0.4454
## hightemp      5.8019   0.8124     7.1421
## 
## Residual standard error: 86.29 on 88 degrees of freedom

You could also delete problem points, but I would strongly recommend avoiding this if at all possible, unless you know with great confidence that a data point is an error and not simply an outlying observation.

library(MASS)
mod.rider.delete = lm(volume~hightemp,data=RailTrail[cooks.distance(mod.rider)<=4/nrow(RailTrail),])
summary(mod.rider.delete)
## 
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail[cooks.distance(mod.rider) <= 
##     4/nrow(RailTrail), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -262.83  -57.10    3.22   54.76  268.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -69.656     57.118  -1.220    0.226    
## hightemp       6.513      0.828   7.866 1.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.64 on 84 degrees of freedom
## Multiple R-squared:  0.4242, Adjusted R-squared:  0.4173 
## F-statistic: 61.87 on 1 and 84 DF,  p-value: 1.124e-11
  1. Compare your three models (basic, robust, and point-deletion) - how and why do they differ? Also, how many points did you delete in the deletion model?
library(texreg)
## Version:  1.35
## Date:     2015-04-25
## Author:   Philip Leifeld (University of Konstanz)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(mod.rider,mod.rider.robust,mod.rider.delete))
## 
## ============================================
##              Model 1     Model 2  Model 3   
## --------------------------------------------
## (Intercept)  -17.08      -25.34   -69.66    
##              (59.40)     (56.90)  (57.12)   
## hightemp       5.70 ***    5.80     6.51 ***
##               (0.85)      (0.81)   (0.83)   
## --------------------------------------------
## R^2            0.34                 0.42    
## Adj. R^2       0.33                 0.42    
## Num. obs.     90          90       86       
## RMSE         104.19                94.64    
## ============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Fairly similar results, but looks like we have alleviated the issue where those few (potentially) rainy days pulls the coefficient downwards. The deletion procedure model has a MUCH higher coefficient, but again, we probably don’t want this - those data still matter (unless you think it will never rain again or something).

Multiple regression diagnostics

Residual plots are somewhat problematic for multiple regression, because we have many different input varaibles. Instead, we will use “added variable plots”

Residual plots, bivariate regression vs multiple regression

In simple linear regression we use residual plots to assess:

  1. Does the mean function appear linear?
  2. Is it reasonable to assume that the errors have constant variance?

Residual plots in multiple regression

We fit the model:

\[ y \sim x_1 + x_2 \]

library(alr3)  #package that accompanies S. Weisberg, Applied Regression Regression, Third Edition, Wiley, 2005
data(caution)
m1 <- lm(y~x1+x2, data = caution)
plot(m1, 1)

If this was a bivariate model, we could conclude that the mean function looks fairly linear but there the errors appear to have increasing variance. However, these are fake data generated from a model with constant variance!!!

StanRes1 <- rstandard(m1)
plot(caution$x1,StanRes1, ylab="Standardized Residuals")

plot(caution$x2,StanRes1, ylab="Standardized Residuals")

In MLR, in general, you cannot infer the structure you see in the residuals vs. fitted plot as being the structure that was misspecified.

  • Non-constant variance in the residuals doesn’t neccessarily suggest non-constant variance in the errors.
  • Non-linear structures don’t necessarily suggest a non-linear mean function.

The only conclusion you can draw is that something is misspecified.

Diagnostic Plots for Multiple Regression

So now what?

  • Although several types of invalid models can create non-constant variance in the residuals, a valid model will always be structureless.

  • If you can be sure you have a good mean function, then the residual plot is more informative.

  1. Added Variable Plots: Used to assess a variable’s impact net of other model variables
  2. \(Y\) versus \(\hat{Y}\): used to assess whether the mean function is being modeled well.
  3. Marginal model plots: used to assess whether the mean function between each predictor and the response is being modeled well.

Added variable plots

The objective of constructing an added variable plot is to assess how much each variable adds to your model.

Consider the some data concerning restaurants in NYC, where we’d like to build the model:

\[ Price \sim Food + Decor + Service + East \]

We can assess the isolated effect of each predictor on the response with a series of simple scatterplots…

nyc <- read.csv("http://andrewpbray.github.io/data/nyc.csv")
par(mfrow=c(2,2))
plot(Price ~ Food, data = nyc)
abline(lsfit(nyc$Food,nyc$Price))
plot(Price ~ Decor, data = nyc)
abline(lsfit(nyc$Decor,nyc$Price))
plot(Price ~ Service, data = nyc)
abline(lsfit(nyc$Service,nyc$Price))
plot(Price ~ East, data = nyc)
abline(lsfit(nyc$East,nyc$Price))

This might be more efficient…

pairs(Price ~ Food + Decor + Service + East, data = nyc)

But this does not provide a way to look at a variable net of other variables. Instead, an added variable plot tells you how much a given predictor \(x_i\) can explain the response after the other predictors have been taken into account. An “av-plot” has:

Making an avplot by hand

First, get the residuals from the model

\[ Price \sim Decor + Service + East \]

resY <- lm(Price ~ Decor + Service + East, data = nyc)$res

Second, get the residuals from the model

\[ Food \sim Decor + Service + East \]

resX <- lm(Food ~ Decor + Service + East, data = nyc)$res

The plot them against each other…

plot(resY ~ resX)

Making an avplot in 2 seconds…

The car package has an avPlots() function that does this for you…

library(car)
m1 <- lm(Price ~ Food + Decor + Service + East, data = nyc)
avPlot(m1,variable = "Food")

avPlots(m1)

Notice that if we fit a line through the AVP, the slope should look familiar…

AVPm1 <- lm(resY ~ resX)
AVPm1$coef
## (Intercept)        resX 
## 5.07403e-17 1.53812e+00
m1$coef
##   (Intercept)          Food         Decor       Service          East 
## -24.023799670   1.538119941   1.910087113  -0.002727483   2.068050156

How to use AVPs

  1. AVPs can be used to assess whether it makes sense to include an additional variable in the model (similar to looking at the p-value of the predictor).

  2. They’re a bit more informative, though, since they would also indicate if the relationship between that predictor and the response is linear in the context of the other variables.

Multiple Regression Practice

Let’s look at home prices in LA…

LA <- read.csv("http://andrewpbray.github.io/data/LA.csv")
plot(price ~ sqft, data = LA, col = "steelblue")

In the data set LA, this scatterplot suggests two influential points but are they influential in a multiple regression model?

  1. Fit the model \(\hat{price} \sim sqrt + bed + city\). By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)
mod.la = lm(price ~ sqft + bed + city,data=LA)
summary(mod.la)
## 
## Call:
## lm(formula = price ~ sqft + bed + city, data = LA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8563518  -293754   170468   512647 45112021 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.284e+06  2.016e+05  -6.368  2.5e-10 ***
## sqft              1.867e+03  3.795e+01  49.207  < 2e-16 ***
## bed              -6.165e+05  5.376e+04 -11.467  < 2e-16 ***
## cityLong Beach    6.662e+05  1.582e+05   4.210  2.7e-05 ***
## citySanta Monica  7.384e+05  1.906e+05   3.874 0.000112 ***
## cityWestwood      5.494e+05  2.338e+05   2.349 0.018924 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1775000 on 1588 degrees of freedom
## Multiple R-squared:  0.7536, Adjusted R-squared:  0.7528 
## F-statistic: 971.3 on 5 and 1588 DF,  p-value: < 2.2e-16
hat <- hatvalues(mod.la)
plot(hat)

Definitely something funny going on here.

High leverage?

levs <- influence(m1)$hat
hist(levs,breaks=100)
abline(v = 2 * length(m1$coef) / nrow(LA), col = "red")

tail(sort(levs))
##         97         69        130        115        117        168 
## 0.06355183 0.06739365 0.07181092 0.07429460 0.20746530 0.21011533

High residual?

e_hat <- m1$res
s <- sqrt(1/(nrow(LA) - length(m1$coef)) * sum(e_hat^2))
r <- e_hat/(s * sqrt(1 - levs))
hist(r)

tail(sort(r))
##      132      103       48      130       30       56 
## 6.475015 7.223771 7.243349 8.989756 9.052396 9.909576

Yes, very high.

Influence

cdist <- (r^2 / length(m1$coef)) * (levs/(1 - levs))
tail(sort(cdist))
##       139       103        50        83        56       130 
## 0.3183459 0.3271137 0.3829354 0.4231406 1.0360441 1.2504889
plot(m1, 5)

Yep, highly influential as well.

  1. Calculate the Cook’s distance of those two observations.
modcooks = cooks.distance(mod.la)
tail(sort(modcooks))
##        1254        1260        1291        1289        1294        1292 
##  0.05408505  0.05847561  0.08015348  0.09034382  4.16812407 23.95308503
  1. Generate the Cook’s distance plot.
plot(modcooks)
abline(h=1,col='red')

table(modcooks > 1)
## 
## FALSE  TRUE 
##  1592     2

Note that we us 1 here instead of 4/nrow(n) because the sample is much larger. If we were to use 4/nrow(n), the heuristic would be 0.0025094, and basically every observation would qualify.

AV Plots

library(car)
avPlots(m1)

  1. Now fit the more appropriate model, with \(logprice\) and \(logsqrt\) and construct added variable plots. What do you learn about the relative usefulness of \(logsqft\) and \(bed\) as predictors?
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
LA <- mutate(LA, logprice = log(price), logsqft = log(sqft))
m2 <- lm(logprice ~ logsqft + bed + city, data = LA)
summary(m2)
## 
## Call:
## lm(formula = logprice ~ logsqft + bed + city, data = LA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26020 -0.24897 -0.01613  0.21804  1.37277 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.13068    0.21201  24.200   <2e-16 ***
## logsqft           1.20729    0.03036  39.769   <2e-16 ***
## bed              -0.03010    0.01284  -2.345   0.0191 *  
## cityLong Beach   -0.88280    0.03467 -25.464   <2e-16 ***
## citySanta Monica -0.09416    0.04022  -2.341   0.0194 *  
## cityWestwood     -0.46244    0.04876  -9.484   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3704 on 1588 degrees of freedom
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8738 
## F-statistic:  2206 on 5 and 1588 DF,  p-value: < 2.2e-16
avPlots(m2)

Overall, this model should look quite a bit better.

par(mfrow=c(2,2))
plot(m2)

Marginal Model Plots

Example: Defective widgets

defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt",
                      header = TRUE)
head(defects)
##   Case Temperature Density  Rate Defective
## 1    1        0.97   32.08 177.7       0.2
## 2    2        2.85   21.14 254.1      47.9
## 3    3        2.95   20.65 272.6      50.9
## 4    4        2.84   22.53 273.4      49.7
## 5    5        1.84   27.43 210.8      11.0
## 6    6        2.05   25.42 236.1      15.6

Let’s look at the pairwise comparisons…

pairs(Defective ~ Temperature + Density + Rate, data = defects)

Here’s our basic model…

\[ \widehat{Defective} \sim Temperature + Density + Rate \]

m1 <- lm(Defective ~ Temperature + Density + Rate, data = defects)
summary(m1)$coef
##               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) 10.3243900 65.9264798  0.1566046 0.87676619
## Temperature 16.0779089  8.2941050  1.9384742 0.06349394
## Density     -1.8272927  1.4970684 -1.2205807 0.23319949
## Rate         0.1167322  0.1306268  0.8936312 0.37971687

View a summary of model diagnostics:

View residuals plotted against each covariate…

par(mfrow = c(2, 2))
r <- rstandard(m1)
plot(r ~ defects$Temperature)
plot(r ~ defects$Density)
plot(r ~ defects$Rate)

\(Y\) versus \(\hat{Y}\)

Used to assess whether your model is doing a good job of modeling the response. If it is, you’ll see points along the identity line. If it’s not, there will be non-linear structure try to correct by transforming the response and assess on a predictor-by-predictor basis using marginal model plots.

The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.

Marginal Model Plots are used to assess the marginal relationship between each predictor and the response. -It compares the fit from the model with the nonparametric fit to the scatterplot. -If your model is well-specified, these two lines will be close to coincident.

You can build them by hand using loess() or use mmp() in the car package.

Now, load the car package (if you haven’t already) and produce the marginal model plots…

par(mfrow=c(2, 2))
library(car)
mmp(m1, defects$Temperature)
mmp(m1, defects$Density)
mmp(m1, defects$Rate)

Or you can use loess to make these by hand:

plot(Defective ~ Temperature, data = defects)
lines(sort(defects$Temperature), sort(m1$fit), lwd = 2)
l1 <- loess(m1$fit ~ defects$Temperature)
lines(sort(l1$x), sort(l1$fit), lwd = 2, col = "red", lty = 2)

plot(Defective ~ Temperature, data = defects)
lines(sort(defects$Temperature), sort(m1$fit), lwd = 2)
l1 <- loess(m1$fit ~ defects$Temperature)
lines(sort(l1$x), sort(l1$fit), lwd = 2, col = "red", lty = 2)
l2 <- loess(Defective ~ Temperature, data = defects)
lines(sort(l2$x), sort(l2$fit), lwd = 2, col = "blue", lty = 2)

An alternative model

\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]

defects <- transform(defects, sqrtDefective = sqrt(Defective))
m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects)
summary(m2)$coef
##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)  5.59297089 5.26400977  1.062492 0.29778175
## Temperature  1.56516337 0.66225665  2.363379 0.02586654
## Density     -0.29166383 0.11953593 -2.439968 0.02181531
## Rate         0.01289857 0.01043012  1.236666 0.22726602

Marginal model plots for second model

Comparing m1 and m2

How do these look?

Recap: MMP vs AVP

  • Marginal model plots: are useful in checking to see that you’re doing a good job of modeling the marginal relationship between a given predictor and the response.
  • Added variable plots: assess how much variation in the response can be explained by a given predictor after the other predictors have already been taken into account (links to p-values).

Goal check?

Questions?